{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1cca1daf-b629-4030-a3dd-fcc2b3c60742",
   "metadata": {},
   "source": [
    "# torchkeras工具函数演示"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2bfd98f-208a-400e-b25f-3b4d23ce89b6",
   "metadata": {},
   "source": [
    "除了以优雅的方式训练pytorch模型，torchkeras 还为算法工程师提供了一些非常实用的工具函数。\n",
    "\n",
    "这些工具函数的使用非常简单，通常只要一行代码就可解决算法工程师常常遇到的一些技术需求，这里稍作演示。\n",
    "\n",
    "例如：\n",
    "\n",
    "\n",
    "* 1，根据关键词抓取百度图片 🔥🔥🔥\n",
    "\n",
    "* 2，根据url下载github文件\n",
    "\n",
    "* 3，根据url获取图片\n",
    "\n",
    "* 4，matplotlib支持中文和负号显示 🔥\n",
    "\n",
    "* 5，matplotlib图像转换成PIL图像\n",
    "\n",
    "* 6，文本转PIL图像\n",
    "\n",
    "* 7，发送邮件\n",
    "\n",
    "* 8，探索性数据分析(EDA) 🔥🔥\n",
    "\n",
    "* 9，合并数据集文件夹\n",
    "\n",
    "* 10，以彩色形式print\n",
    "\n",
    "* 11，格式化打印dataframe\n",
    "\n",
    "* 12，打印带时间分割线的日志 🔥\n",
    "\n",
    "* 13，图片分析和重复图片清洗工具 🔥🔥🔥\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b75ac1bc-e807-436c-bbf1-3b8417f65b6d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U torchkeras "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d94172e9-a95b-4adc-8a8e-b2a7b1907bc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys \n",
    "sys.path.append(\"..\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba4d80fb-c59e-402b-94a1-3694f5ddaa62",
   "metadata": {},
   "source": [
    "## 1，根据关键词抓取百度图片"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6718a028-7530-42ea-88cc-f7091c522044",
   "metadata": {},
   "outputs": [],
   "source": [
    "from  torchkeras.data import download_baidu_pictures \n",
    "download_baidu_pictures(keyword='猫咪', needed_pics_num=100, save_dir='cats')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "958bb46f-d8cd-4cc2-9df5-ef136f505948",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "9da4061b-9053-4abf-97df-e76756f8f487",
   "metadata": {},
   "source": [
    "## 2， 根据url下载github文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "857f7f1c-6c70-4287-8515-30b41c03c916",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.data import download_github_file \n",
    "download_github_file('https://github.com/lyhue1991/YOLOv8_tools/blob/main/wandb_callback.py')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd2abed3-fc77-489f-890e-0506135b8a66",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "503bd6d0-a8d9-41bd-b761-fe63790d5d0b",
   "metadata": {},
   "source": [
    "## 3，根据url获取图片"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25ad0544-8781-4ea5-a90e-4b8cae8a56d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.data import download_image \n",
    "img = download_image('https://pic1.zhimg.com/v2-10423b9e7bfccf690d7a0d16189029dd_1440w.jpg?source=d16d100b')\n",
    "img "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e98a4be4-871a-4dcb-bf60-7a8feba0afde",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "ee5a3d99-5380-4afe-a01f-cac5d8acfb02",
   "metadata": {},
   "source": [
    "## 4， matplotlib支持中文和负号显示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db20aa3c-43cb-4b07-85ee-a4366f9f9793",
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline \n",
    "import numpy as np \n",
    "from torchkeras import plots \n",
    "plots.set_matplotlib_font(font_size=12) \n",
    "import matplotlib.pyplot as plt \n",
    "\n",
    "x = np.linspace(-2*np.pi,2*np.pi,1000)\n",
    "y = np.sin(x)\n",
    "plt.plot(x,y)\n",
    "plt.title('正弦曲线')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b623db94-09da-45e5-87d0-33c795fd9065",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "695f4d90-7604-4e7f-a522-feaa5cfd4b4d",
   "metadata": {},
   "source": [
    "## 5， matplotlib图像转换成PIL图像"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52f9a49f-ba40-4272-94de-cee6835c3334",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt \n",
    "import numpy as np \n",
    "x = np.linspace(0,2*np.pi,1000)\n",
    "y = np.sin(x)\n",
    "plt.plot(x,y)\n",
    "fig = plt.gcf()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5938580c-5f36-4805-a1b0-85f9ca999aba",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.plots import fig2img \n",
    "img = fig2img(fig)\n",
    "img "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7aa3882b-990e-43f9-85ef-b2961a328c70",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24499219-f7ec-46a0-8562-508238259c85",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "89f74cc9-9fbd-4521-94fd-d2b9d9427920",
   "metadata": {},
   "source": [
    "## 6，  文本转PIL图像"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3fcf12b1-6103-4afa-8fba-290dd8d412e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.utils import text_to_image\n",
    "text_to_image('hello world\\n你好中国！\\n你好北京!')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05ee8820-73c4-47bf-86d2-e33d90befa7e",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d521c533-c950-4316-91f4-12ea1bf61970",
   "metadata": {},
   "source": [
    "## 7，发送邮件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f7d891b-2560-42e7-ae79-f3db4086457b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.email import send_msg \n",
    "send_msg(receivers =['745554619@qq.com'],\n",
    "         subject='hello', msg='hello world')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "326f2585-1fb4-4f84-ab36-9329ffaedeea",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5ead3e1f-472b-4c47-97d9-2f97c10be936",
   "metadata": {},
   "source": [
    "## 8，探索性数据分析(EDA)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dfefd981-ce9e-4125-b9c7-f242d9a50526",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "import pandas as pd \n",
    "from torchkeras.eda import pipeline \n",
    "\n",
    "\n",
    "breast = datasets.load_breast_cancer()\n",
    "df = pd.DataFrame(breast.data,columns = breast.feature_names)\n",
    "df[\"label\"] = breast.target\n",
    "dftrain,dftest = train_test_split(df,test_size = 0.3)\n",
    "dfeda = pipeline(dftrain,dftest)\n",
    "dfeda "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0aa209aa-8830-4601-9234-f066c59cf3aa",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "959dc0c7-1893-49bd-8008-d52fcd2965a3",
   "metadata": {},
   "source": [
    "## 9，合并数据集文件夹"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5badf70f-d426-428e-a67b-138503dde177",
   "metadata": {},
   "source": [
    "图像任务相关的数据集通常会整理成文件夹形式，例如yolo格式。有时候我们会以增量的形式不断地新做一些数据。\n",
    "\n",
    "有没有什么办法可以快速地把新的数据集文件夹和老的数据集文件夹方便的合并呢？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "caeac889-a418-43e5-bc1b-36b36fff386c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path \n",
    "for folder in ['ds1','ds2']:\n",
    "    for  tp in ['images','labels']:\n",
    "        for part in ['train','val']:\n",
    "            path = Path(folder)/tp/part\n",
    "            path.mkdir(parents=True, exist_ok=True)\n",
    "            for i in range(3):\n",
    "                if tp=='images':\n",
    "                    (path/f'{i}.jpeg').touch()\n",
    "                else:\n",
    "                    (path/f'{i}.txt').touch()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6abe50bd-dab4-411f-83ba-20441e6a7459",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.data import merge_dataset_folders \n",
    "from_folders = ['ds1','ds2']\n",
    "to_folder = 'ds_merge'\n",
    "merge_dataset_folders(from_folders,to_folder)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7315c8d7-5437-4fad-9653-d137ea1f4352",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "eca6980c-5bb7-4953-88b2-4adac83bd31d",
   "metadata": {},
   "source": [
    "## 10，以彩色形式print "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15343a4e-c22a-4827-a6fc-82ee87914eb1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.utils import colorful \n",
    "print(colorful('helloworld'))\n",
    "print(colorful('helloworld',color='blue'))\n",
    "print(colorful('helloworld',color='blue'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "575a8022-999d-4aa8-97b5-b49e977e7d8c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44101d38-1ae3-4ce9-bb12-54f1c78b05f5",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "53383895-38af-4be0-b3fb-8f2b28bbbd3d",
   "metadata": {},
   "source": [
    "## 11，格式化打印dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b17d55f-1709-4435-80ca-db3888d295f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_diabetes\n",
    "ds = load_diabetes(as_frame=True)\n",
    "df = ds['data'].copy()\n",
    "df['target'] = ds['target']\n",
    "df['text'] = 'hello\\t 你好中国\\n 你好 北京'\n",
    "\n",
    "from torchkeras.utils import prettydf \n",
    "prettydf(df,nrows=10,ncols=10);\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a82be506-be03-41be-8dd6-3eabb93dde33",
   "metadata": {},
   "source": [
    "## 12，打印带时间分割线的日志"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78c13734-b11c-4242-840b-2fd1fb3e9cc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.utils import printlog \n",
    "printlog('step1: reading data...')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c257472-b3fa-4920-857c-3a3743f95045",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f16b04f6-159a-4b09-bb5c-e08f718a17e9",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d9e52f96-c506-4f80-9fbf-a2930356db6c",
   "metadata": {},
   "source": [
    "## 13，图片分析和重复图片清洗工具"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80bd3f4b-6d94-48ba-ba1c-9adc676cb57f",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install fastdup "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6b0718a-d351-43ab-8afb-b1ab5459022e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from  torchkeras.data import download_baidu_pictures \n",
    "download_baidu_pictures(keyword='猫咪', needed_pics_num=500, save_dir='cats')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73028844-2710-426a-ac56-4a15b4a3f0f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from torchkeras.data import ImageCleaner\n",
    "cleaner = ImageCleaner(img_files = 'cats')\n",
    "cleaner.run_summary(duplicate_similirity=0.99, outlier_percentile=0.02)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f76ddc1-3fb4-4e48-ab0d-44ec0bef683c",
   "metadata": {},
   "outputs": [],
   "source": [
    "dfduplicates = cleaner.get_duplicates() \n",
    "dfduplicates "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08529dc6-1c8b-44f0-b3e8-931ec613e36f",
   "metadata": {},
   "outputs": [],
   "source": [
    "dfstats = cleaner.get_stats()\n",
    "dfstats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "662a3334-f5ab-4d3f-87e4-b68eefdc8eb7",
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaner.vis_duplicates() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b31b4ff1-23a8-42a1-aa60-a2d38a1f51d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "cleaner.delete_duplicates() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab141ec5-e9f5-4410-9bca-25e8e4bd08a8",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
